CMMC Watch Improvements Catalog

This file catalogs the source-ingestion and resilience improvements ported from daily-trending-info, and where they are implemented in this repository.

1) Canonical Source Catalog

Implemented in: - scripts/source_catalog.py

What changed: - Centralized all collector sources into typed SourceSpec entries. - Added per-source metadata (tier, source_type, risk, parser, fallback_url). - Added reusable profile maps: - HEADER_PROFILES for source-class request headers. - DOMAIN_FETCH_PROFILES for host-specific timeout/retry tuning. - Added indexed access helpers: - get_collector_sources() - get_health_sources() - get_source_by_key() - get_source_by_source_key()

Why it matters: - Removes source drift between collector and health checks. - Makes source additions/removals explicit and reviewable.

2) Source Metadata Registry

Implemented in: - scripts/source_registry.py

What changed: - Added unified source metadata resolution for ranking and display: - get_source_metadata() - source_metadata_dict() - format_source_label() - source_quality_multiplier() - Added exact metadata generated from the source catalog and prefix fallbacks.

Why it matters: - Ranking can account for source quality/risk consistently. - UI labels now include source-quality context.

3) Hardened RSS Fetching

Implemented in: - scripts/collect_trends.py

What changed: - Added feed runtime controls: - in-memory cache - persistent cache (data/feed_runtime_cache.json) - cooldown after repeated failures - host-specific retries and retry delays - fallback URL support - Added content-type/feed-content validation before accepting responses. - Added timestamp normalization helpers for heterogeneous feed and API timestamps.

Why it matters: - Prevents repeated hard failures from taking down collection runs. - Keeps pipeline productive when sources intermittently fail.

4) Deduplication Upgrade (Clustering + Corroboration)

Implemented in: - scripts/collect_trends.py

What changed: - Replaced simple title similarity dedup with clustered dedup using: - token overlap - jaccard similarity - semantic ratio (SequenceMatcher) - Canonical trend selected by quality-aware score. - Duplicates now merge corroborating source URLs and source diversity.

Why it matters: - Reduces duplicate noise while preserving multi-source corroboration signals.

5) Quality-Aware Ranking

Implemented in: - scripts/collect_trends.py - scripts/source_registry.py

What changed: - Final scoring now combines: - base relevance score - recency boost - source-quality multiplier - source-diversity boost

Why it matters: - Promotes high-quality corroborated reporting over single-source low-signal posts.

6) Source Health Monitoring

Implemented in: - scripts/source_health_check.py

What changed: - Added parallel health checks for all cataloged sources. - Adds clear source status classifications: - healthy - degraded - flaky (intermittent/retry-success) - down - Supports fallback validation and writes data/source_health.json.

Why it matters: - Provides daily operational visibility into failing or unstable sources.

7) Test Coverage for Resilience Paths

Implemented in: - tests/test_trend_collector.py - tests/test_source_health_check.py

What changed: - Added tests for: - fallback feed usage - cached response usage during cooldown - corroboration-aware dedup behavior - quality-influenced ranking order - flaky source classification - fallback success classification - health source coverage for CMMC Reddit feeds

Why it matters: - Reduces regression risk on the most failure-prone ingestion logic.

8) Runtime Artifact Hygiene

Implemented in: - .gitignore

What changed: - Added ignore rule for feed runtime cache: - data/feed_runtime_cache.json

Why it matters: - Prevents ephemeral runtime cache files from polluting git history.

🕸️ Ada Research Browser